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Abstract 

Our aim is to construct a finite automaton recognizing the set of words 
that are at a bounded distance from some word of a given regular language. 
We define new regular operators, the similarity operators, based on a 
generalization of the notion of distance and we introduce the family of 
regular expressions extended to similarity operators, that we call AREs 
(Approximate Regular Expressions). We set formulae to compute the 
Brzozowski derivatives and the Antimirov derivatives of an ARE, which 
allows us to give a solution to the ARE membership problem and to 
provide the construction of two recognizers for the language denoted by an 
ARE. As far as we know, the family of approximative regular expressions 
is introduced for the first time in this paper. Classical approximate regular 
expression matching algorithms are approximate matching algorithms on 
regular expressions. Our approach is rather to process an exact matching 
on approximate regular expressions. 



1 Introduction 

This paper addresses the problem of constructing a finite automaton that rec- 
ognizes the language of all the words that are at a distance less than or equal 
to a given positive integer k from some word of a given regular language. Our 
approach is based on the extension of regular expressions to approximate regu- 
lar expressions (AREs) that handle distance operators. More precisely, we first 
define a new family of operators: given an integer k, the operator is such 
that, for any regular language L, the language Ffc(L) is the set of all the words 
that are at a distance less than or equal to k from some word of L. We then 
consider the family of approximate regular expressions obtained from the family 
of regular expressions by adding the family of Ft operators to the set of regular 
operators. We provide a formula that, given a regular language L, computes the 
quotient of the language ¥k(L) with respect to a symbol. We finally extend the 
computation of Brzozowski derivatives Q (resp. of Antimirov derivatives [H) to 
the family of approximate regular expressions. The first benefit of the derivation 
of an ARE is that it yields an elegant solution for the approximate member- 
ship problem. Moreover, the set of Brzozowski derivatives (resp. of Antimirov 
derivatives) of an ARE is shown to be finite. As a consequence, the derivation 
of an ARE enables the computation of a finite automaton that recognizes the 
language of this ARE. 



1 



The similarity between two words is generally measured by a distance and 
two basic types of distance called Hamming distance and Levenshtein distance 
(or edit distance) are generally considered. In our constructions the similarity 
between two words is handled by a word comparison function, that is more 
general than a distance (for instance, a comparison function is not necessarily 
symmetrical). It is the reason why we will speak of similarity operators rather 
than of distance operators. 

The aim of this paper is to investigate the properties of the AREs family, in 
particular to define formulae for computing the set of (Brzozowski or Antimirov) 
derivatives of an ARE and to check the properties of this set. This theoretical 
study leads to a solution for the approximate membership problem as well as to 
a solution for the approximate regular expression matching problem (based on 
the automaton associated with the set of derivatives of an ARE). However, this 
paper is not an algorithmic contribution to the approximate regular expression 
matching problem: it investigates new automaton-theoretic constructions that 
hopefully make a sound foundation for the design of new approximate matching 
algorithms, but it does not present new efficient algorithms. 

Let us recall that approximate matching consists in locating the segments 
of the text that approximately correspond to the pattern to be matched, i.e. 
segments that do not present too many errors with respect to the pattern. 
This research topic has numerous applications, in biology or in linguistics for 
example, and many algorithms have been designed in this framework for more 
than thirty years especially concerning approximate string matching (see [H, 
for a survey of such algorithms). Two contexts can be distinguished: in the 
off-line case, that is when a pre-computing of the text is performed, the basic 
tool is the construction of indexes [9]; otherwise, the basic technique is dynamic 



programming 12J. In both cases, automata constructions have been used, either 



to represent an index (l8l. |2| or to simulate dynamic programming Q. 

Several studies address the problem of constructing a finite automaton that 
recognizes the language of all the words that are at a distance less than or equal 
to a given positive integer k from a given word. For instance this problem is 
considered in Q where Hamming distance is used and in (l7| where Levenshtein 
distance is used. A challenging problem is to tackle the more gen eral case where 
the pattern is no longer a word but a regular expression (TBI Il9|. The solution 
described in 11] first computes k+1 clones of some non-deterministic automaton 
recognizing the language of the regular expression and then interconnects these 
clones by a set of transitions that depends on the type of distance. 

As far as we know, the family of approximate regular expressions is intro- 
duced for the first time in this paper. Approximate regular expression matching 
algorithms described in the papers above-cited are approximate matching al- 
gorithms on regular expressions. Our approach is rather to process an exact 
matching on approximate regular expressions. 

This paper is an extended version of Q . Classical notions of language theory, 
such as derivative computation, are recalled in Section [5] Section [3] gives a 
formalization of the notion of word comparison function and provides a definition 
of the family of approximate regular expressions. The usual case of Hamming 
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and Levenshtein distances is addressed in Section^ Finally, Section[S]is devoted 
to the general case and derivative-based constructions of an automaton from an 
approximate regular expression are described. 



2 Preliminaries 

Given a set X , we denote by Card(X) the number of elements in X. 
A finite automaton A is a 5-tuple (E, Q, I, F, 8) with: 

• E the alphabet (a finite set of symbols), 

• Q a finite set of states, 

• I C Q the set of initial states, 

• F C Q the set of final states, 

• (ScQxSxQ the set of transitions. 

The set 6 is equivalent to the function from Q x E to 2^ defined by: q' G S(q, a) 
if and only if (q, a, q') G 5. The domain of the function S is extended to 2^ x E* 
as follows: VP C Q, Va G E, Vw G E*, 5(P,e) = P, 5{P,a) = \J peP S{p,a) and 
5(P,a ■ w) — S(S(P,a),w). The automaton A recognizes the language L{A) = 
{w G E* | 5(1, «i)nF / 0}. The automaton A is deterministic if Card(-T) = 1 
and V(q,a) G Q x E, Card(%, a)) < 1. 

A regular expression E over an alphabet E is inductively defined by: 

E = 0, E = e, E = a, 
E = (F + G), E = (F ■ G), E = (F*) 

where a is any symbol in E and F and G are any two regular expressions. 

The language L{E) denoted by E is inductively defined by: 
L(0) = 0, L(a) = {a}, L(e) = {e}, 

L(E + F) = L(E) U L{F), L(E ■ F) = L(E) ■ L{F) and L(F*) = (L(F))* 

where a is any symbol in E, F and G are any two regular expressions, and 
for any L\,L2 C E*, 

Li U L 2 = {w | w G Lt V W G i 2 }, 
Lf L 2 = {wiw 2 | 10i G -Li A ui 2 G £2} 
and = {u>i ■ • • t« fc | fe > 1 A Vj G {1, . . . , k}, w 3 G Li} U {e}. 
A language L is regular if there exists a regular expression P such that 
L(E) = L. It has been proved by Kleene [loj that a language is regular if and 
only if it is recognized by a finite automaton. 

Given a language L over an alphabet E and a word w in E*, the membership 
problem is to determine whether w belongs to L. It can be solved by the 
computation of the boolean r(w, L) defined by: 

v(w L) = { 1 if W E L ' 
^ ' ' [0 otherwise. 

The quotient of L w.r.t. a symbol a is the language a _1 (L) = {w G E* 

aw G L}. It can be recursively computed as follows: 
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a-H<D) = a-\{s}) = a-\{b}) = 0, a -\{a}) = {s} 

•-^Li U L3) = a-^LO U a" 1 ^), a -1 ^) = ^(Li) • L* 
-'(^•LsUa- 1 ^) ifr( £ ,L 1 )=l, 



/ a-^Li) -^Uo- 1 ^) ifr(e,Li) 
11 ' 2j " \ o- x (Li) • L 2 otherwise. 



The quotient to 1 {L) of L w.r.t. a word w in E* is the set {«/ EE* w • w' G 
L}. It can be recursively computed as follows: e _1 (L) = L. {aw')~HL) = 
w'- 1 (a- 1 (L)) with a e E and to' G E+. The Myhill-Nerode Theorem (lj [l^ 
states that a language L is regular if and only if the set of quotients {m~ 1 (L) 
u e E*} is finite. 

Since r(w,L) = r(e, ui _1 (L)), the membership problem can be solved using 
the quotient formulae and the following straightforward computation of r(e, L): 
r(e,{a})=r(e,0) = O, r(e,{e}) = L 
r(e, Li U Li) = r(e, Li) V r(e, L 2 ), r(e, Li • L 2 ) = r(e, Li) A r(e, L 2 ), 

r( £ ,L*) = l. 

The notion of derivative of an expression has been introduced by Brzo- 
zowski Q. The derivative of an expression E w.r.t. a word w is an expression 
denoting the quotient of L(E) w.r.t. w. Let E be a regular expression over an 
alphabet E and let a and b be two distinct symbols of E. The derivative of E 
w.r.t. a is the expression -j-(E) inductively computed as follows: 



da 



£( £ ) = £(6)=0, i-(a)=e, 



d-(F*) = £(F).F*, *-(F + G) = -$-{F)+*-{G) 
d_ip r]= f l(F)-G +1 fJG)iiT(e,L(F)) = l, 
d * K ' \ £(F)-G otherwise. 
The derivative of E is extended to words of E* as follows: 

H E ) = E ^( E ) = ^&))- 
Since w- l (L(E)) = L(£(E)), it holds r(w,L(E)) = r(e, L{-±-(E))). For 
convenience, we set r(w,E) = r(w,L(E)). Notice that the boolean r(e, E) can 
be inductively computed as follows: 

r(e, a) — r(e, 0) = 0, r(e,e) = 1, 
r(e, E 1 U E 2 ) = r(e, £i) V r(e, £ 2 ), r(e, £i • E 2 ) = r(e, £i) A r(e, £ 2 ), 

r( £) ^) = l. 

As a consequence, derivation provides a syntactical solution for the mem- 
bership problem. 

Notice that the set T>e of derivatives of an expression E is not necessar- 
ily finite. It has been proved by Brzozowski Q that it is sufficient to use the 
ACI equivalence (that is based on the associativity, the commutativity and the 
idempotence of the sum of expressions) to obtain a finite set of derivatives: the 
set T>' E of dissimilar derivatives. Given a class of ACI-equivalent expressions, a 
unique representative can be obtained after deleting parenthesis (associativity), 
ordering terms of each sum (commutativity) and deleting redundant subexpres- 
sions (idempotence). Let E^ g be the unique representative of the class of the 
expression E. The set of dissimilar derivatives can be computed as follows: 
£(0) = = £(6) = 0,£(o)= Ej 
f(E + F) = (f(F) + £(G))„„ f(F*) = £(F) • F*, 
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Uf-g) = 



(f 9 (F)-G+£(G)U ifr(e,F) = l, 
(4-(F)-G)„ a otherwise. 
The dissimilar derivative finite automaton B'(E) = (E, Q, {qa}, F, S) of a regular 
expression E over an alphabet E is defined by: 



• Q = V' E , 

• qo = E, 

. F = {qeQ \eeL( q )}, 



5 = {(q,a,q')€QxXxQ\£(q) = q'} 



The automaton B'(E) is deterministic and it recognizes the language L(E). Its 
size can be exponentially larger than the number of symbols of E. 

Antimirov's algorithm [l| constructs a finite automaton from a regular ex- 
pression E. It is based on the partial derivative computation. The partial 
derivative of a regular expression E w.r.t. a symbol a is the set -§-{E) of ex- 
pressions defined as follows: 

f(0) = f(e) = f(&)=0, f(a) = { £ }, 
l(F + G) = 1(F) U 1(G), l(F*) = £(F) ■ F*, 



l(F-G) = 



-S- (F) ■ G otherwise, 



with for any set £ of expressions, £ ■ F — [J Ee£ E ■ F. 
The partial derivative of E is extended to words of E* as follows: 
!(£) = {£},£-(£) = £(£(25)), 

with for a set £ of expressions, ■§-(£) = U_eg£ F~C^0- Every element of the 
partial derivative of E w.r.t. a word w in E* is called a derivated term of E 
w.r.t. w. The set of the derivated terms of E is the union of the sets of the 
derivated terms of E w.r.t. w, for all w in E*. Antimirov [l[ has shown that the 
set VTe of the derivated terms of E is such that Card(2?T e) < n+ 1, where n 
is the number of symbols of i?. 

Furthermore, for any word w in E*, UiS'e 8 (e) E(E') — w~ 1 (L(E)). Con- 
sequently, the partial derivation provides another syntactical solution for the 
membership problem as well as a finite automaton computation. Indeed, it can 
be shown that r(u;, E) — \J E , £ _a_( E ^ r(e, E'). 

The derivated term finite automaton A(E) = (E, Q, {q$}, F, 5) of a regular 
expression E is defined as follows: 



• Q = VT E , 

• qa = E. 

• F = {qeQ\r(e,q) = l}, 

• 6={(q,a,q')eQx-ExQ\q' e£(q)}. 



The automaton A(E) recognizes the language L(E). 

In this paper, we consider the approximate membership problem that is de- 
fined as follows: 

Given a regular expression E over an alphabet S, a word w in E*, a function 
F from S* x S* to N and an integer k, is there a word w' in L(-E) satisfying 
F(iy, w') <kl 

In the following, we provide a syntactical solution for the approximate mem- 
bership problem in the case where the function F satisfies specific properties. 



3 Comparison Functions: Symbols, Sequences and 
Words 

Let £ be an alphabet, S — £ U {e} and X be a subset of 5* x S. A cost function 
C over X is a function from X to N satisfying Condition 1: for all a in S, 
C(a, a) = 0. For any pair (a, (3) in S X S such that C(a, /3) is not defined, let us 
set C(a, j3) = _L Consequently, a cost function can be viewed as a function from 
S x 5 to NU {_L} satisfying Condition 1. Since we use _!_ to deal with undefined 
computation, we set for all x in N U {-L}, -L + x = x + J- = x — _L = _L — x = _L 
and for all integers x, y in N, x — y — _L when y > x. A cost function can be 
represented by a directed and labelled graph C = {S, V} where V is a subset of 
S x (NU {_!_}) x S such that for all (a, /3) in S x 5, C(a,/3) = k& (a, k,f3) eV. 
Transitions labelled by _!_ can be omitted in the graphical representation, as well 
as the implicit transitions (a, 0,a) (See Example [T]). 

Example 1. Let S = {a, b, c}. Let C be the cost function defined as follows: 



ifx = y, 
4 if x = a A y = c, 

C(x, y)=^ 3 if x — c A y = a, 

1 if x E {a, c} A y = 6, 
_L otherwise. 
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Figure 1: The cost function C 
The cost function C can be represented by the graph in Figure [H 

Given a positive integer k we now consider the set S k of all the sequences 
s — (si, . . . , Sk) of size k made of elements of 5*. A sequence comparison function 
is a function T from lJ fcgN 5 fc x S'' 1 to NU {^}- Given a pair (s, s') of sequences 
with the same size, T(s, s') either is an integer or is undefined. In the following 
we will consider sequence comparison functions J- satisfying Condition 2: T 
is defined from a given cost function C over S x S, and Condition 3: J 7 is a 
symbol-wise comparison function, that is, for any two sequences s — (s\, . . . , s n ) 
and s' — (s' l7 . . . , s' n ), it holds: 



F(s,s') =F((s 1 ),(s' 1 ))+F((s 2 ,. 



••><))=£* 



fce{l,...,n} 



G 



1 if a ^b, 
otherwise. 



We consider that those functions satisfy Condition 1 , i.e. for all a in S, 
T((a), (a)) = 0. Consequently, for any pair of sequences s = (si, . . . , Sfe) and 
s' = (s'i,... ,s' k ) such that k > 1, Condition 4 is satisfied: if there exists an 
integer fc' in {1, . . . , fc} such that Sfe/ = s' fe , = e, then: 

!T{{s 2 , s k ), (s' 2 , . . . , s' fe )) if fc' = 1, 

J"((Sl, . . . , Sfc-l), (s'l, • • • , Sfe-l)) if fc ' = fc > 

J"((si, . . . , s fc -_i, s fc -+i, . . . , s fe ), (si, . . . , s' fe '_i, s'fe'+i' • • • ' s 'k)) otherwise. 
As a consequence of Condition 3, a symbol- wise sequence comparison function is 
defined by the images of the pairs of sequences of size 1. Notice that a sequence 
comparison function is not necessarily symbol- wise. e.g. for a given cost function 
F, . . . , s n ), (s[, ... , <)) = £fc e{ l,...,„ } F( Sfc ,' S ' fe ) fc . 

Two of the most well-known symbol-wise sequence comparison functions are 
the Hamming one (H) and the Levenshtein one (£) respectively defined for any 
integer n > and for any pair of sequences s — (si, . . . , s n ) and s' = (s[, . . . , s' n ) 
in 5" x S n by: 

= £ fc6{ i,..., n} H(* fc , s' k ), C(a, s') = £ fce{li ... in} L( Sfe , 4), 
with H and L the two cost functions respectively defined for all a, b in E U {e} 
by: 

!_L if (a = e V b = e) A (a, 6) ^ (e, e), 
1 if a 7^ 6, and L(a, 6) = 

otherwise, 

Let us now explain how a word comparison function can be deduced from 
a sequence comparison function. Let w be a word in E* and \w\ be its length. 
The sequence s = (s\, . . . , s„) in S n is said to be a split-up of w if Si • • • s n =w. 
The integer n is the size of s. The set of all the split-ups of size fc of a word 
w is denoted by Split fc (u>) and the set of all the split-ups of w is denoted by 
Split (w). 

Let J 7 be a sequence comparison function, (u,v) be a pair of words of E*, 
and fc be a positive integer. We consider the following sets: 

Y(u, v) = {T(u\ v') | 3k e N, fc > 1 A {u', v') e SpUt fc (u) x SpUt fc («)} n N, 
Y m (u, v) = {Tiu 1 , v') | 3k e N, 1 < fc < m A (it', u') G SpUt fc (u) x Split fe (u)} n N. 

Definition 1. Let J 7 be a sequence comparison function. The word comparison 
function associated with J- is the function F from E* x E* to N U {J-} defined 
by: 

¥(u,v) = min{F(u, v)} ifY(u,v) ^ 0, ¥(u,v) = _L otherwise. 

Notice that a word comparison function is not necessarily symmetrical. In- 
deed, some problems can be modelized with a non-symmetrical function. For 
instance, given two words w and w', can w be obtained from w' by deleting some 
letters, i.e. is id a subword of w'7 Such a problem can be modelized by the 
word comparison function D associated to the symbol-wise comparison function 
V defined for any pair of sequences of length 1 by: 

C ifa = /3, 

V(a,/3) G (EU{e}) 2 , 2>((a), (/?)) = <^ 1 if a - e A (3 e E, 

[ _L otherwise. 
It can be shown that for any two words w and w' in E*: 
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(w, w') 



_L if w is not a subword of w' , 

\w'\ — \w\ otherwise. 
In the case of a sequence comparison function based on a cost function, the 
whole set N needs not to be considered. Indeed, according to Condition 4, if 
u / e or i; ^ e, then Y(u, v) — YL|_u u i(u, v) and we can write: 

!0 if u = v = e, 

min{ Y|„| +M (u, v)} if (u, v) ^ (e, e) A (u, u) 7^ 0j 

_L otherwise. 
The Hamming distance H and the Levenshtein distance L are the word com- 
parison functions respectively associated to the sequence comparison functions 
% and C. Both of them satisfy the properties of word distances^. Notice that in 
the following we will handle word comparison functions that are not necessarily 
distances (see Example [1] for the definition of a nonsymmetrical cost function) . 

Example 2. Let C be the cost function defined in Example^ Let s = (si) 
and s' — (s[) be two sequences of size 1. We define four symbol-wise sequence 
comparison functions by setting the images of the pairs of sequences of size 1 
from the cost function C. 

^ c (s,s') = C(s 1 ,s , 1 ), o c (s,s')=min{C(s 1 ,s' 1 ),C(s' 1 ,s 1 )}, 

(s,s') = C(s[,s 1 ), =4 C (s,s') =min a:esu{e} {C(s 1) ^) + C(s[,x)}. 

Let us consider the two split-ups s = (a, c, a) and s' = (c, a, c). According to 
Figure^ it holds: 



(*,*') 



11, 
10, 

9, 

(«,«') =6. 



(*,*') 
(*,*') 



s — ( a c a ) 
>°: I 4 I 3 I 4 V 



s — ( a c a ) 
s' — ( c a c ) 



s — ( a c a ) 
> C : |3 {3 |3 )g 

s' = ( c a c ) 



s — ( a c a 

+1 +1 +1 

t c : 6 6b 

+1 +1 +1 

s' = ( c a c 



Figure 2: Examples of sequence comparisons 

Any word comparison function can be used as a language operator in order 
to compute the set of words that are at a bounded distance from some word of 
a given language. 

Definition 2. Let L be a language over an alphabet E, F a word comparison 
function and k an integer in N U {J-}. Then: 

{w E £* I 3u e L, F(w, u) G {0, . . . , k}} if he Hi, 
otherwise. 



F fe (L) 



The operator F& is called a similarity operator. Let us notice that Ffc(Ffc/ (L)) 
is not necessarily equal to Fk+k' (L). Indeed, let us consider the three languages 



1 A word distance D is a word comparison function satisfying the three following properties 
for all x,y,z e £*: (1) 0(x,y) = =S- a: = y, (2) B(x,j/) = x), (3) B(x,y) +B(j/,z) > 
B(x,z). 
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la = Fi({a}), L 2 = Fi(Fi({a})) and L 3 = F 2 ({a}) over the alphabet £ = 
{a, b} with F the word comparison function associated with the symbol-wise 
sequence comparison function T defined for any symbol a, (3 by J-((a), ((3)) = 
if a = f3, J-((a),(f3)) — 2 otherwise. Then L\ = L% = {a} whereas L3 = 
{s, a, b, aa, ab, ba}. 

Definition 3. An approximate regular expression^ (ARE) E over an alphabet 
£ is inductively defined by: 

E = 9, E = e, E = a, 
E = F + G, E = (F ■ G), E = (F*), 
E = F k (F) 

where a is any symbol in S, F and G are any two AREs, F is any symbol-wise 
word comparison function and k is any integer in N U {-L}- 

Definition 4. The language denoted by an ARE E is the language L(E) in- 
ductively defined by: 

L(0) = 0, L(s) = {e}, L(a) = {a}, 
L(F + G)= L(F) U L(G), L(F ■ G) = L(F) ■ L[G), L(F*) = L(F)*, 
L(¥ k (F))=¥ k (L(F)). 
where a is any symbol in S, F and G are any two AREs, F is any symbol-wise 
word comparison function and k is any integer in N U {-L}- 

In order to prove that the language denoted by an ARE E is regular, we will 
show how to compute a finite automaton recognizing L(E). 

4 Hamming and Levenshtein Derivation Formu- 
lae 

In this section, we extend the derivation formulae to the family of approximate 
regular expressions where the word comparison functions are the usual Hamming 
and Levenshtein distances. Notice that the proofs are not given in this section, 
but will be stated in Section 15.41 deduced from the proof of the general case 
provided in Section [S] 

Let a be a symbol in an alphabet £ and L be a regular language over S. Let 
k be an integer and L' = h k (L). The quotient of L' w.r.t. a is by definition the 
set of words w such that there exists a word w' in L' satisfying L(aw, w') < k. 
Consequently, we distinguish the four following cases, according to the way w' 
can be split: 

1.1// = aw" and L(a, a) + L(w, w") < k: hence the word w" is by definition 
in a _1 (L) and L(w, w") < k. Consequently, w 6 Lfc(a _1 (L)); 

2. w' — bw" with b G S \ {a} and L(a, b) + h(w,w") < k: hence the word 
w" is by definition in b~ l (L) and h(w,w") < k — 1. Consequently, w € 

2 The fact that any ARE denotes a regular language is proved in Corollary [T] 
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3. L(o, e) +L,(w,w') < k: hence the word w' is by definition in L and 
li(w,w') < k — 1. Consequently, w G L/ £ _i(L); 

4. u/ = W with feeS and L(e, 6) + L(aw, to") < k: hence the word 
w" is by definition in b~ l (L) and L(aw, to") < k — 1. Consequently, 
fflea-'fLuft- 1 ^))). 

Notice that for the Hamming distance, only the two first cases need to be 
considered since H(a, j3) = _L whenever a = e and (3 ^ e or a ^ e and /3 = s. 
As a consequence, the following lemma can be stated. 

Lemma 1. Let L be a regular language over an alphabet E, a be a symbol in E 
and k be an integer mPjU{l}. Then: 

a-'iWkiL)) = M^a-^L)) U U 6eE \ W H^ir 1 (£)) , 



U L fc _i(L) 



In the remaining of this section, we consider restricted AREs that only use 
Hamming and Levenshtein distances. 

Definition 5. Let E be an alphabet. An Hamming-Levenshtein Approximate 
Regular Expression (EL ARE) over E is an ARE over E satisfying the following 
condition: 

For any subexpression G = ¥k(H), either ¥ = H or ¥ = L. 



4.1 Brzozowski Derivatives for an HLARE 

In this subsection, we extend the Brzozowski derivation to the HLAREs. From 
an HLARE E and a word w, Brzozowski derivation allow us to syntactically 
compute an HLARE D' W (E), called the dissimilar derivative of E w.r.t. w, 
denoting the language w~ 1 (L(E)). 

Definition 6. Let E be an HLARE over an alphabet E. Let a and b be two 

distinct symbols in E and w be a word in E*. The dissimilar derivative of E 
w.r.t. the symbol a (resp. the word w) is the HLARE D' a (E) (resp. D' W {E)) 
defined as follows: 

D' a (s) = D' a (<H) = D' a (b) = 9, D' a (a)=s, 
D' a {E 1+ E 2 ) = (D' a (E 1 ) + D' a (E 2 ))^ > D' a {E{) = D'M) ■ E{, 
" (D' a (E 1 )-E 2 +D' a (E 2 )U ifv(e,E 1 ) = l, 
{D' a {E 1 )-E 2 )^ B ifi(e,E 1 )=0, 
D'ME,)) = (¥&k(D' a (Ei)) + E 6eS \ W m-imE^U, 
( UiD'AE,)) \ 

+ EbeEVW^-iPb^i)) 
+ L fc _i(£?i) 
V +^(E 6eE Lfe-i(^(Si))) J 



D' a (E 1 -E 2 
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E 
D' 



u 



if W = E, 

{D' a {E)) ifw = auAa£Z A u G £*, 



where E\ and E2 are any two HLARES and k is any integer ibNU {-L}- 

Lemma 2. Let E be an HLARE over an alphabet E. Let w be a word in E*. 
Then: 



Next lemma shows that the boolean r(e, E) is syntactically computable for 
any HLARE E using dissimilar derivatives. 

Lemma 3. Let E = Mk(E') and F = Lfe(F') be two HLAREs over an alphabet 
E. Then the two following propositions are satisfied: 

• EG L{E) ^ee L(E'), 

. e G L(F) e G L(F') U U„ eS L{U-i{D' a {E'))). 

Given an HLARE E 1 , we denote by V HL {E) the set {D' W (E) w G E*} of 
the dissimilar derivatives of £7. 

Lemma 4. TTie sei T>hl{E) of dissimilar derivatives of an HLARE E is finite. 

From this finite set, one can compute a deterministic finite automaton that 
recognizes L(E). 

Definition 7. Let E be an HLARE over an alphabet E. The tuple B'(E) — 



(E, Q, /, F, 5) is defined by: 

• Q = V HL {E), 

• I = {E}, 

. F = {qeQ\v(e,q) = l}, 

• V(q,a) G Q x E, S(q,a) = {D' a (q)}. 



Proposition 1. Let E be an HLARE over an alphabet E. Then: 

B'(E) is a deterministic finite automaton that recognizes L{E). 

For any HLARE E, the automaton B'(E) is called the dissimilar derivative 
finite automaton of E. 

Example [3] presents the computation of the dissimilar derivative automaton 
of an HLARE. Example 2] illustrates the computation of the boolean r(w, E) for 
an HLARE E. Notice that in both of these examples, the following reductions 
are used: 



L(D' W (E)) 



w-\L{E)). 



E + = + E = E, 
E-Q = 0-£ = 0, 
E ■ e = e ■ E = E, 
Fl(£) = 0. 
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Example 3. Let F = b*(a + b)c* and E = Hi(F) be an HLARE over £ = 
{a, b, c}. The dissimilar derivatives of E are the following expressions: 



D' a (E) - 


= H (F)+H 1 (c*)+H 


to(c*) = 


= 


D' a {Ei)=l 


Ho(c*) 


= £4 


D' b (E) - 


= £ + Hi(c*)+Ho(c* 


) 


= E 2 


D&Ek) =1 


io(J') + 


H (c*) = B 3 


D' C {E) - 


= H (F)+H (c*) 




= E 3 




BIi(c*) + 


H (c*) = E 5 


D' a (E 2 ) 


= H OF)+Hi(c*)+I 


HIo(c*) 


= E X 


^(£3) =1 


Io(c*) 


= Ei 


D' b (E 2 ) 


= £ + Hi(c*)+H (c 




= E 2 


^(£3) = I 


io(F) + 


H (c*) = S 3 


D' C (E 2 ) 


= H (F)+H (c*)+I 


Hi(c*) 


= E t 


D' C (E 3 ) = I 


Ho(c*) 


= E± 


D' a (E 4 ) 


= 






^(^5) =1 


Ho(c*) 


= E A 


D' b (E 4 ) 


-0 






^(£5) = I 


Ko(c*) 


= Ei 


D' c {Ea) 


= H (c*) =E 4 






D' C (E 5 ) = I 


ffii(c*) + 


H (c*) = E 5 



The dissimilar derivative automaton of E is given Figure 




Figure 3: The dissimilar derivative automaton of E = Hi (6* (a + 6)c*) 

Example 4. Let G = Li((a6a + abb)a(a)*) be an HLARE over the alphabet 
X = {a, b} and w = aba be a word in £*. Then: r(a6a, G) = r(e, D' aba (G)). Let 
us first compute the HLARE D' aba (G): 

D' a (G) = Li((6a + bb)a(a)*) + L ((afea + abb)a(a)*) = Gi 
D'^Gx) = Li((a + o)a(a)*) + L ((oa + bb)a{a)*) + L (a(a)*) = G 2 
^(G 2 ) = Li (0(0)*) + L (a(a)*) + L ((a + 6)a(o)*) + L ((a)*) = G 3 
Hence r(aba,G) — r(e, G3). Furthermore, since e S £(Lo(£*^(a(a)*))) 7 it 
£/ia£ e G L(Li(a(a)*)). Consequently, r(e, G3) = 1 and aba belongs to L{G). 
Notice that in this case: 

1. The word w is split up into s w = (a, 0, a, e); 

2. The word w' = abaa in L((aba + abb)a(a)*) can be split up into s w > = 
(a, b, a, a); 

3. It holds £(s w ,s w >) = 1. 

Another split-up is presented in Example^ 

4.2 Antimirov Partial Derivatives of an HLARE 

In this subsection, we extend the Antimirov derivation to the HLAREs. From 
an HLARE E and a word w, Antimirov derivation allows us to compute a set 
A W (E) of HLAREs, called the partial derivative of E w.r.t. w. Any HLARE 
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A a (E 1 -E 2 ) 
,(H fc (£i) 

A (L fc (f?i)) 
A W (E) 



in A W (E) is called a derivated term of E w.r.t. w. Finally, we state that the 
union of the languages denoted by the derivated terms in A W (E) is equal to 
w-HL{E)). 

Definition 8. Let E be an HLARE over an alphabet E. Let a and b be two 

distinct symbols in E and w be a word in E*. The partial derivative of E w.r.t. 
the symbol a (resp. to the wordw) is the set A a {E) (resp. A W (E)) of HLAREs 
defined as follows: 

A a (e) = A o (0) - A„(6) - 9, A a (a) = {e} 7 
A a (E 1 + E 2 ) = A„(£i) U A a {E 2 ), A a {El) = A a (E 1 ) ■ E{, 
A a (E 1 )-E 2 UA a (E 2 ) tfT(e,E 1 ) = l, 
A a (E l )-E 2 ifr(e,E 1 )=0, 
A^Hfc^)) = Hfc(A (£i)) U U beS \ {a} H^A^)), 

/ L^Aa^i)) \ 

UU 6eS \ W L fe -i(A b (i?i)) 
U {L fc _i(£?i)} 

V u a^u^l^^a,^)) y 

{E} ifw = s, 

A w ,(A a (E)) ifw = aw'Aa£Y< A w' G E*, 
where E\ and E 2 are any two HLARES and k an integer inNU {_L} and 
where for any set £ of HLAREs, for any HLARE F, for any symbol a in E, 

£-F = {\ EeS {E-F}, 
A a {£)={j Ee£ A a (E), 

H fe (£)=lW{H fc (£)h 
U{£)=[} Eee {U{E)}. 

Lemma 5. Let E be an HLARE over an alphabet E. Let w be a word in E*. 
Then: 

\}g^ w (e)L{G)=w- 1 {L{E)). 

Next lemma shows that the boolean r(e, E) is syntactically computable for 
any HLARE E using partial derivation. 

Lemma 6. Let E = W k (E') and F = L fe (F') be two HLAREs over an alphabet 
E. Then the two following conditions are satisfied: 

• e e L(E) L(E'), 

. e e L(F) * e 6 L(F') U U aeS , Ge A a (F') HU-i(G)). 

Given an HLARE E, we denote by VT hl{E) the set [j weTi , A W (E) of the 
derivated terms of E. 

Lemma 7. The set T>T hl{E) of the derivated terms of an HLARE E is finite. 

From this finite set, one can compute a finite automaton that recognizes 
L(E). 

Definition 9. Let E be an HLARE over an alphabet E. The tuple A{E) = 
(E, Q, L, F, S) is defined by: 
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• Q = VThl(E), 

• I = {E}, 

• F = {qeQ\T(e,q) = l}, 

• V(g,o) eQxE, S(q,a) = A„(g). 

Proposition 2. Let I? 6e an HLARE over an alphabet S. Then: 
A(E) is a finite automaton that recognizes L(E). 

For any HLARE _E, the automaton is the derivated term finite au- 

tomaton of _E. 

Example [5] presents the computation of the derivated term automaton of an 
HLARE. Example [5] illustrates the computation of the boolean i(w,E) for an 
HLARE E. Notice that in both of these examples, the five following reductions 
are used: 

E + 9 = 9 + E = E, 
E = E = 
E ■ e = e ■ E = E, 
Fl(.E) = 0, 
{0}-0H 

Example 5. Let E be the HLARE defined in Example^ The partial derivatives 
of E are the following sets of expressions: 

A a (E) = {H (F) ) H 1 (c*),Ho(c*)} AJH^c*)) = {H (c*)} 
A b (E) = {£,Hi(d*),Ho(c»)} Afc(Hi(c*)) = {H (c*)} 

Ac(£) = {H (F),Ho(c*)} A c (H!(c*)) = {Hi(c*)} 

A a (Ho(F)) = {H (c*)} A a (H (c 
A 6 (H (F)) ={H (F),Ho(c*)} A 6 (H (c 
A C (H (F)) = A c (H (c*)) = {H (c*)} 

The derivated term automaton of E is given Figure [JJ 




Figure 4: The derivated term automaton of E = Hi (b* (a + b)c*~) 

3 The four first equalities are HLAREs reductions whereas the last one is a HLARE set 
reduction. 
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Example 6. Let G = Li((a6a + abb)a(a)*) be the HLARE defined in Example^ 
and w = aba be a word in S*. Then: r(aba,G) = V_f/eA b (G) r ( £ >^)' Let us 
first compute the HLARE set A a b a (G): 

A a (G) = {Li(6aa(a)*),Li(66a(a)*),L ((a6a + a66)a(a)*)} 
= Gi 

A b (£i) = {Li (aa(a)*), h {baa{a)*), Li(6o(o)*), h {bba{a)*), L (a(a)*)} 

= 02 

A Q (£ 2 ) = {Li(a(a)*),Lo(aa(a)*),L ((a)*),L (6a(a)*),L (a(a)*)} 

Hence v(aba,G) = VifeG v(e,H). Furthermore, since e G L(Lo((a)*)), /io/rfs 
i/iai r(e, G3) = 1. Finally, aba belongs to L(G). 
Notice that in this case: 

1. The word w is split up into s w = (a, b, e, a); 

2. The word w' = abaa in L((aba + abb)a(a)*) can be split up into s w > = 
(a, b, a, a); 

3. It holds £(s w ,s w >) = 1. 

Another split-up is presented in Example^ 

5 Word Comparison Functions, Quotients and Deriva- 
tives 

In this section, we address the general case. We present two constructions of an 
automaton from an ARE using Brzozowski's derivatives and Antimirov's ones, 
respectively leading to a deterministic automaton and a non-deterministic one. 
We first show how to compute the quotient of a given language ¥k{L) w.r.t. a 
symbol a, where F is a given word comparison function, k is an integer and L 
is a regular language. 

5.1 Quotient of a Language 

Let F be a word comparison function associated with a symbol-wise sequence 
comparison function T defined over an alphabet S. Let k be a positive integer, 
a be a symbol in E, u — aw be a word of S + , and L' be a regular language 
of S*. According to Definition [51 the word u is in L = Ffc(X') if and only if 
there exists a word v € L' such that W(u, v) < k. According to Defmition[TJ this 
is equivalent to the existence of an alignmen iQ (u',v') G Splitjw) x Split n (u), 
where n is a positive integer, between u and v, the cost F(u',v') of which is 
not greater than k. Let u' = (u[, . . . , v' n ) and v' = (v[, . . . , v' n ). (a) If n = 1, 
¥(u,v) = .F((a), (v[)) and since u = aw, a G L w G Ffe_jr(( a ) i ( t ,j))«J (L')- 
(b) Otherwise, let us set u" — (u' 2 , ■ ■ ■ , u' n ) and v" — (v' 2 , . . . ,v' n ). Moreover, let 

4 An alignment between two words u and v is a pair (s, s') of sequences of same size such 
that s S Split (u) and s' g Split(u). 



15 



us set t = u if u'i = e and t = u' 2 - ■ ■ u' n otherwise; let us similarly set z = v if 
v[ = e and z = v 2 ■ ■ ■ v' n otherwise. Obviously, the word z belongs to v{ (L 1 ). 
Since F is a symbol-wise word comparison function, there exists an alignment 
(u',v) between u and v satisfying J-(u',v') < k if and only if there exists an 
alignment (u",v") between t and z satisfying F(u" ,v") < k — ^((u^), (v[)). Ac- 
cording to Definition [TJ this is equivalent to the existence of a word ze»J 1 (L') 
such that ¥(t, z) < k — ^((u^), (v[)). According to Definition [2j it is equivalent 
to say that the word t is in Fj-^jt^^^ ^/ ^ [v[ {L')). Depending on the value 
of (u'^u'jj we can distinguish the following cases: 

Case 1 (tti,«i) = (a, 6), with 6 G E: u = aw e L 4^ w <E F fc __ 7 r( a . b )(6 _1 £') ! 
Case 2 (iti,^) = (a,e) with a G E: u = aw eittue Ffe_ i 7r( a .e)(i')j 
Case 3 (u'^v'-j = (e, 6), with teS: u = aw e L ^ w e a -1 (F fc-jr^f,) (6 _1 X')) . 
Since w G a _1 Ffc(L') aw £ Ffc(L'), the three previous cases provide a recur- 
sive expression of the quotient of the language Wk(L') w.r.t. a symbol a G E. 
Unfortunately, its computation may imply a recursive loop, due to Case 3, when 
J-'ds), (b)) = 0. It is possible to get rid of this loop by precomputing the set of 
all the quotients of L' w.r.t. words w such that F(e, w) = 0. In this purpose, 

let us set W T = (U 6eE ,.F((e),<6))=oW)* and X ( L ') = {^UU.ew.K 1 ^')}- 
Let us notice that if L' is a regular language, the set of its residuals is finite; as 
a consequence, so is X(L'). 

Lemma 8. Let L = Ffe(L') be a language over an alphabet E where V is a 
regular language, ¥ is a symbol-wise word comparison function associated with 
a sequence comparison function T and a be a symbol in E. The quotient of L 
w.r.t. a is the language a~ 1 (i) computed as follows: 

U L x = | UL''eX(L')^es( F fe--F((a),(b))( &_1 ( i "))) U UL"eX(L>) ¥ k-r((a),(e))(L") 

\ U a _1 (UL''ex(L').bes,jr(( £ ).(( ) ))^o( IF fc-^((£),(;)))( &_1 ( i ")))) 
where X(L') = {L>} U [j weW ^ W - l (L') with Wr - U eS .^((e),( 6) )=o{ 6 })*- 

Proof. For any symbol a, /3 in S U {e}, let us set k a ^ = k — ^((a), (/?)). 
u G a _1 (-^) « «m £ £ « 3m e L',¥(au, w) G {0, . . . , k} 

!3b G E, 3wi6u;2 G i', F(e, wi) = A F(u, w 2 ) < fca,& 
V 3w!W 2 G L', F(e, wi) = A ¥(u, w 2 ) < fc Q , £ 
V 3b G E, 3wi6u;2 € L', F(e, wi) = A J"(04 (6)) ^ A F(au, w 2 ) < fc e ,& 




36 G E,3 Wl G W^,3w 2 G (w 1 b)- 1 (L'),¥(u,w 2 ) < k a . b 

V 3 Wl G Wr,3w 2 G (ioi)- 1 ^')^^,^) < fc a , £ 

V 36 G E, 3wi G Wjr, 3w 2 G (w^)" 1 ^, 
J((e),(6))^0AF(ra, W2 ) < k £ . b 



( 3b G E,3w 2 G b-\[)L»eX{L>) L"),¥{u,w 2 ) < k a , b 

<S> < V 3w 2 G \J L "eX(L') L "> F ( U > W 2) < fca, £ 

[ V 36 G E, 3 W2 G 6- 1 (U L „ eX(L0 £")> ^((e). (6)) ^ A F(o«, w 2 ) < fc £ , 6 
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( 3be^u^¥ k ^{j L , leX{LI) b-\L") 

{ V"eU ra(i0 \ t (i") 

[ vleS, rae F*, ib (U i « 6X (LO 6_1 ( L ")) 

r u e {J L >>ex(L>).bev ¥ k-^((a),(b))b^ 1 {L") 

V ue{J L l>eX(L') ¥ k-T((a),(s))(L") | ^ □ 

[ V U e a _1 (UL» e x(L'),beS^((e),(6))#0 F fc-^((e),(b)) 6_1 ( L ")) 

5.2 Brzozowski Derivatives for an ARE 

An extension of Brzozowski derivatives can be directly deduced from the com- 
putation of the quotient presented in Lemma [5] 

Definition 10. Let E = Ffc(£") be an ARE over an alphabet E where F is 
associated with J- and a be a symbol in E. The dissimilar derivative of E w.r.t. 
a is the expression §r{E) defined by: 

( EFex(s'),6es( F fe-^((a).W)(^( i? ))) N 

V + ^■(Efex(£;'),beE,^((£),(b))^o( 1F fc-^((£),(fc))(^( F )))) / ^ 
where X(E') = {E'} U {j weW ^ jr{E>) with = (U 6eS ^(( £ ),(6))=oWr ■ 

Let us show that the set of dissimilar derivatives of any HLARE E is finite 
(Lemma |9|), that the dissimilar derivative of E w.r.t. a word w denotes the 
quotient of L{E) w.r.t. w (Lemma I10| and how to determine whether the 
empty word belongs to the language denoted by E (Lemma ITTj) . 

Lemma 9. Let E = Wk(E') be an ARE over an alphabet E and T>e be the set 
of dissimilar derivatives of E. Then T>e is a finite set of AREs. Moreover, its 
computation halts. 

Proof. Consider that F is associated with T . Let us show by induction over the 
structure of E' and by recurrence over k that T>e is a finite set of AREs. 

By induction, the set T>e' is a finite set of AREs. Consequently, since X(E') 
is a subset of T>e', (Fact 1) X(E') is a finite set of derivatives of E'. 

In order to show that T>e is a finite set, let us show that any derivative G 
of E satisfies the property P(E',k): G is a finite sum of expressions of type 
Ffc'(G') with k' < k and G' a derivative of E' . 

According to Fact 1, any subexpression Ffc_j7r(( a w E ))(-F) with F G X(E') 
satisfies P(E', k). Since X(E') is a subset of 4r(F) is a derivative of E' for 

any b in E. Consequently, the expression Y J F&x{B'),beT.^k-^{{a),{b)){jr{F))) 
also satisfies P(E', k). Finally, by recurrence hypothesis, for k' < k, any deriva- 
tive of an expression Ffc'(G') satisfies P(G", k'). Consequently, any derivative of 
F fc -^( (e ),(6))(^(F)) satisfies P(^(F), k - ^((e), (6))) if .F((e), (6)) ^ 0. Since 
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F is a derivative of E' , so is ^p-(F), and since k — F((e), (6)) < k, any deriva- 
tive of (F)) satisfies P(E',k). As a consequence, (Fact 2) any 

derivative of E w.r.t. a symbol a satisfies P(2?', k). 

Let us show now that if an expression H satisfies P(E', k), then any symbol 
derivative of H also satisfies P(E' , k). Since H is a sum of expressions of type 
Ffe'(G') where k' < k and G" is a derivative of E' , any symbol derivative H' of 
iJ is the sum of the derivatives of the expressions H is the sum of. According 
to Fact 2, any symbol derivative of an expression ¥y{G') satisfies P(G',k'). 
Since G is a derivative of E 1 and k' < k, any expression satisfying P(G',k') 
also satisfies P(E' , fc). As a consequence, any derivative of E w.r.t. a word w 
in S* satisfies P(£',fc). 

As a conclusion, since any derivative of E is a sum of expressions all belonging 
to the finite set {¥k>(G) \ k' < k A G ^ T>e'}, using the ACI-equivalence, T>e 
is a finite set of AREs. Moreover, by induction over E' and by recurrence over 
k, since any derivative of an expression F in X(E') belongs to the finite set 
of derivatives of E' the computation of which halts, and since F{{£), (b)) ^ 
implies that k — -F((e), (&)) < k, the computation of T>e halts. □ 

Lemma 10. Let E = Fk(E') be an ARE over an alphabet £ and a be a symbol 
in S. Then (E)) = a~ x (L{E)). 

Proof. By induction over the structure of E. According to Lemma [5] 

( UL"eX(L(E')),be^( ¥ k~T((a),(b))( b ^ 1 ( L "))) 

a- 1 (L(E)) = l ^{JL"ex(ME')) ¥ k-T((a)^))(L ,r ) 

[ U a _1 (UL''ex(L(£;')).ftes,^((e).(b))#o( F fe-^((£),(b))(^ 1 ( i ")))) 
where X(L(E')) = {L(E')}U\J weW ^w- l (L(E')) with W T = (U& eE ,.F(( e ) ) W)=o{&})*- 



Let X(E') = {E 1 } U \J weW:F jr{E'), By induction over E' , for any word 



w 



in £*, w 1 (L(E'j) = L(J-(£")). As a consequence, there exists a surjection f 
from X(E') to X(L(E')) such that for any expression G in X(E'), f(G) = L(G) 
belongs to X(L(E')). As a consequence: 

!UE"ex(E>),bez( ¥ k-r«a),(b))(b~ 1 (L{E")))) 
U a_1 (U£;» e x(£;'),fceS^((e),(fc))#o( F fe-^((e),(b))( 6_1 ( i (- B "))))) 

By induction over for any derivative E" of E', b^iLtE")) = L(%-(E")). 

a b 

Consequently: 

U E "ex(E').be^( ¥ k-J : -aa),(b))( L (ir b (E")))) 
a-\L(E)) = { U \J E ^X(E') V k-n(a),(e))(HE")) 

U a ^HUE''eX(E')MS,J r ((e)^b))^o( ¥ k-J r ((s)^b))(L(jr{E'')))) 



i (Ei;''eX(B'),(,6E( F t-/((a),(i))(|( £ ")))) 

J i (Ei?" e x(£;') F fe-^((i),(£))(- B ")) 

J a ~ 1 ( i (EiS" G X(£;'),6eS^((e),(6))#o( F fc-^((e),(b))(^( i 

Furthermore, by recurrence over fc, for any J-"((e), (&)) > 0, it holds: 
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a-HL(¥ k ^ aEUb)) (^(E'')))) = L(%(F k _ H(e)t{b)) {$r{E»)))). 
Finally, 

L (I2E"eX(E'),bez( ¥ k-T((a),(b))(jr(E")))) 

a-\L{E)) = I U L(^ E „ eX{EI) ¥ k _ n{aUE)) (E")) 

k u i (^(EE»ei(E'),(»eE,f(( £ ),((>)) ; io( F t-^((£),(i>))(|( £ ")))) 

□ 

Lemma 11. Let E = Ffc(£") be an ARE over an alphabet £ and a be a symbol 
in E. Le£ Wf one? 6e t/ie sets defined by: 

= (U6 eE ^(( e)l (6))=o{&})* 

a nrfX( J B') = {i?'}uU^ w ,^(^)- 
Lei ms consider the language L' defined by: 

L' = U F ex(E>) L ( F ) u ^(Ef ex(E'),bev,T((s),(b))=£o( ¥ k-r«s),(b))(w b (F)))). 
Then the two following propositions are equivalent: 

• e G L(f?) 

i ^1 A eeL'. 

Furthermore, this equivalence defines a membership test that halts. 

Proof. Let Wf = (U eE ,;F(( e ),(6))=oW)*> X ( E ') = i E '} U U„ e w. % ( E ') and 
for any symbol a, /3 in S, let us set fc Qj/ 3 = k — T({a), (/?)). Obviously, fc = ± 
e ^ L(E). Consequently, if k ^ _L: 
£ G ^ 3w e F(e, to) e {0, . . . , k} 

3wGL(E'),¥(s,w)=0 
& { V 36 G E, 3wi6u; 2 G L(E'), 

F(e, toi) = A J"((£), (6)) 7^ A F(e, w 2 ) < fc e , b 

3io e £(£')> ™ e 

V 36 G S, 3 Wl G Wjr, 3w 2 e {wtby^LiE')), 
/•((£), (6)) ^0AF(e,w; 2 ) < fc e , b 

3u> e e 

V 36 G E,3u, 2 G (b)~ 1 (\J FeX (E>) L(F)),F((e), (6)) ? A F(e,«; 2 ) < k e . b 
j ^U F ex(E')MF) 

\ V 36 G E, 3w2 G L(E W) ^((e), (6)) ^ A F(e, w 2 ) < k £ , b 

<=> £ G UfgX(F') ^(-P) V e ^(EfceS,FeX(F'),F((e),(b))^0 1F fe e ,b(^(- F ))) 

Furthermore, (a) by induction over the membership test defined by e G 
Ufgx(F') ^C^O halts; (b) by recurrence over k since k Sib < k when J"((e), (6)) ^ 
0, the membership test defined by: 
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£ e L (T l FeX(E')M^,J r ((e),(b))^o( ¥ k-^((e),(b))(^r(F)))) 

halts. □ 

Lemma IH1 ensures that the derivative automaton B'(E) of an ARE E, com- 
puted from the set T>e of dissimilar derivatives of E following the classical way, 
is a finite recognizer. Lemma [TT1 ensures that the set of final states can be com- 
puted, since the number of derivatives is finite. Finally, Lemma [TU] ensures that 
the DFA D recognizes L(E). 

Definition 11. Let E be an ARE over an alphabet E. The tuple B'(E) = 
(E, Q, I, F, S) is defined by: 

• Q = V E , 

• I = {E}, 

. F = {qeQ\r(e,q) = l}, 

• V(q,a) £QxS, S(q,a) = {^(q)}. 

Proposition 3. Let E be an approximate regular expression. Then: 
B'(E) is a deterministic automaton that recognizes L(E). 

Proof. Let B'(E) — (E,Q,I,F,8). Let isbea word in E*. Let us show by 
recurrence over the length of w that S(E,w) = {-§-{E)}. 
If w G E, proposition is satisfied by definition of 6. 

If w = w'a with w' G E* and a G E, by recurrence hypothesis it holds 
S(E,w') = {/-(£)}. By definition of 6: 

6{E,w'a) = S(S(E,w'),a), 
= 8{{f- i {E)},a) 

As a first consequence, since Card(I) = 1, since 5 is a function from Q x E* 
to 2®, and since for any pair (q,a) in Q x E, Card(<5(g, a)) = 1, then the tuple 
B'(E) is a deterministic automaton. Moreover, 

w G L{B'{E)) & (5(£, w'a) n F / 
^ {/-(£)} ni^0 

^r(£,/ ; (S)) = l 
^ io G L(-E) 

□ 
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For any ARE E, the automaton B'(E) is the dissimilar derivative finite 
automaton of E. Consequently, according to Kleene theorem, we have the 
following corollary. 

Corollary 1. The language denoted by any ARE is regular. 



5.3 Antimirov Derivatives for an ARE 

Partial derivatives are defined by means of sets of expressions instead of expres- 
sions and thus lead to the construction of a nondeterministic recognizer. We 
now extend partial derivatives to the family of AREs. For convenience, let us 
set for £ a set of expressions ¥k(£) = Ubg£ ^k(E) and L(£ ) = [J E£ £ L(E). 

Definition 12. Let E — Fk(E') be an ARE over an alphabet E where ¥ is 
associated with J- and a be a symbol in E. The partial derivative of E w.r.t. a 
is the set -§-(E) computed as follows: 



d a 

£-(E)= { | JFex(B'),6es( F fc-^((Q),(b))(^( i;l ))) u U F ex(E') ¥ k-F«a),( s ))(F) 



where % = (U 6eE ,;F( (e ),(6))=oW)* and X ( E ') = i E '} U lUw, £(£')• 

Lemma 12. Let E = ¥k(E') be an ARE over an alphabet E and a be a symbol 
mE. Then L(-^(E)) =aT l (L(E)). 

Proof. By induction over the structure of E. 
According to Lemma [5] 

( UL» e x(L(£;')),ftes( F fc-J r ((a),(6))( 6_1 ( L "))) 
a- 1 (L(E)) = l ^\jL"eX(L(E')^k-n(a),(e))(L") 

[ U a _1 (UL"ex(L(£;')).ftes,F((e).(b))#o( F fe-^((£),(b))(^ 1 ( i ")))) 
where X(L(E')) = {L(E')} U \J weW , w~ l (L(E')) with W T = (UeE^COO.W^oW)* 
LetX(E>) = {E'}u{J weWT -JL(E>). 

By induction over E', for any word w in E*, w^ 1 (L(E')) = L(-§-(E')). As 
a consequence: 

Ul"GX(L(_E')) L " = ^E"eX(E') L ( E ") 

and: 

{UE"ex(E>).bez( W k-r«a),(b))(b~ 1 (L(E")))) 
^UE^X(E^k-n(a),(e))(L(E")) 
Ua 1 (UE»eX(iS'),6es,^((e),(6))^o( F fe-J : "((£),(6))( 6 1 ( L ( E "))))) 
By induction over E' , for any derivative E" of E', it holds 
b-\L(E"))=L{jL{E")). 

Consequently: 
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a~\L{E)) 



V}E"ex(E'),beT,( ¥ k-T((a),(b))(L(§^(E")))) 

U Ue>>£X(E>) ¥ k-r((a),(£)){L(E")) 

U a_1 (Ui3» e x(£'),beS^((e),(&))#o( F fe-^((e),(b))( I '(^( £; ")))) 



X (Ui3» e x(i3'),bes( F fe-^((a),(()))(^(^")))) 
U L (UE"eX(E>) ¥ k-T((a),(e))(E")) 

U a _1 ( i (Uij" e x(£;')^es,^(( £ ),(6))#o( F fc-^(( £ ),(( ) ))(^(^")))) 
Furthermore, by recurrence over k, for any T((e), (6)) > 0. it holds: 
a -^L(F k _ H{eUb)) (%;(E")))) = ^£(F fe ^ ((£) , w) (f (£"))))• 
Finally, 

! L (U£;» e x(iJ'),beE( F fe-^((a),(b))(^( S ")))) 
Ui(lJB»« W F fc -7((a),( e ))(£")) 
U L (£(UB»eX(iJ'),beS^((e),(b))#o( F fe-^((e),(b))(^(^")))) 

and 

a-HL(E))=L(l a (E)). 

□ 

Let "DTe be the set of derivated terms of an ARE E, that is the set of the 
elements of all the partial derivatives of E. 

Lemma 13. Let E = F k (E') be an ARE over an alphabet X. Then: 

VT E c(J k , e{a _ k} F k ,(VT E >)- 
Moreover, the computation ofVT E halts. 

Proof. Consider that F is associated with T . Let us define the set S(E' , k) = 
Ufc'G{o k} F fc' {T^T e')- Let us show by induction over the structure of E' and by 
recurrence over k that VT e C S(E', k). Since X(E') is a finite set of derivated 
terms of E', any subexpression of type F fc _jr(( a ) j ( e ))(i ;l ) with F e X(E') belongs 
to S(E' , k). Since X(E') is a subset of VT e', $-(F) is a set of derivated terms 

of E 1 for any b in E. Consequently, Ufex(B'),!.es( F ^(W,(ft))(|( f ))) is a 
subset of S(E',k). Finally, by recurrence hypothesis, for k! < k, any partial 
derivative of an expression F k >(H) is a subset of S(H,k'). Consequently, any 
partial derivative of F fe _jr(( e ) j (&))(^(-F)) is included into Uf's a (f) S(F',k — 

T((e),(b))) if J- ((e), (b)) ^ 0. Since F is a derivated term of E' , so is any 
expression in -§^(F), and since k — T((e),ib)) < k, any partial derivative of 

F fe-J r ((e),(&))(^(-^ 1 )) * s a subset of S(E',k). As a consequence, (Fact A) any 
derivated term of E w.r.t. a symbol a belongs to S(E', k). 

Furthermore, let us show that if G = F k * (H) is an expression that belongs to 
S(E', k), then any partial derivative of G is a subset of S(E', k). According to 
Fact A, any partial derivative of an expression Fk'(H) is a subset of S(H, k'). 
When H is a derivated term of E' and k! < k, any expression in S(H, k') belongs 
to S(E' , k). As a consequence, any derivated term of E belongs to S(E', k). 

As a conclusion, VT e C S(E',k) — Ufc'e{o fc} Ffe ' ^PTe> ) ■ Moreover, by 
induction over E' and by recurrence over k, since any derivated term of an 
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expression F in X(E') belongs to the finite set of derivated terms of E' the 
computation of which halts, and since k — J-((e), (bj) < k when J-((e), (bj) ^ 0, 
the computation of VTe halts. □ 



Corollary 2. Let E = ¥ k (E') be an ARE over an alphabet E. Then VTe is a 
finite set of AREs. Furthermore, Card(DT e) < Card(DT e>) x (k + 1). 

Lemma 14. Let E = Fk(E') be an ARE over an alphabet E and a be a symbol 
in E. Let Wjr and X(E') be the sets defined by: 

= (u 6es .^(( £ ),( 6 ) )= omr 

andX(E') = {E'}u{j weWT ^(E'). 
Let L' be the language defined by: 

L ' = UFex(E') L ( F ) u l (Uf ex(E>),bev,r((s),(b))^o( ¥ k-r((e),(b))(-§^(Pjjj)- 
Then the two following conditions are equivalent: 

• e G L(E), 

• k ^ _L As G L'. 

Furthermore, this equivalence defines a membership test that halts. 

Proof Let = (U 6e E^(( £ ) ; (6))=o{^)* and *W = {^'} U LUw, 

For convenience, for any two symbols a and /3 in E U {s}, let us set k a jj = 

k - T((a), (13)). Obviously, if k = _L e £ L(E). For k ^ _L: 

e G o3we L(E'), F(e, to) G {0, . . . , fc} 

f 3u> G L(-E'), F(£, it;) = 
^ < V 1 £ E, 3wi6u;2 G 

[ F(e, wi) = A (6)) ^ A F(e, w 2 ) < fc(e),(b)) 

f 3u> G L(E'),w G Wjr 
^ V 3fo G E, 3u>i G Wf, 3w 2 G (tuift)- 1 ^^'))) 
{ T((e),(b))^0AW(e,w 2 )<k {£hib)) 

( 3w G L(E'),e G w _1 (Wjf) 
^ J V 36 G E, 3u> 2 G W-^Ufgx^) 
[ J-((e),(6))^0AF(e, U ; 2 )<fc (£);(b)) 

f £ e Uf£X(e') L{F) 
<*l V 3fo G E, 3w 2 G L(Ujr eX(B<) 

[ J((e),(i))^0AF( £ , W2 )<fc (£)!(t)) 

| V e G L.({Jbez,Fex(E'),r((e),(b))^o^ k (r_),(b)) (d^( F ))) 

Furthermore, (a) by induction over E', the membership test defined by e G 
\J FeX (E') L(F) halts; (b) by recurrence over k since k £ )_b < k when T((e), (6)) 7^ 
0, the membership test defined by 
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£ G L (UfeX(£;'),66S^((e),(b))#o( F fc-^((£),(b))(^( F )))) 

halts. 



□ 



Corollary [2] ensures that the derivated term automaton A(E) of an ARE 
E, computed from the set T>T e of derivated terms of E following the classical 
way, is a finite recognizer. Lemma IT4l ensures that the set of final states can be 
computed. Finally, Lemma [T2l ensures that the NFA A recognizes L(E). 

Definition 13. Let E be an ARE over an alphabet E. The tuple A(E) = 
(S, Q, /, F, 5) is defined by: 

• Q = VT E , 

• I = {E}, 

• F = {qeQ\v(e,q) = l}, 

• V(q,a) GQxE, 5(q, a) = §-{q). 

Proposition 4. Let E be an approximate regular expression. Then: 
A(E) is a finite automaton that recognizes L{E). 

Proof. Let A(E) = (E,Q,I,F,S). Let is be a word in £*. Let us show by 
recurrence over the length of w that S(E,w) = -^-{E). 
If to G S, proposition is satisfied by definition of 5. 

If w = w'a with w' 6 E* and a 6 S, by recurrence hypothesis it holds 
S(E,w') = ^- ; (E). By definition of 5: 

5{E,w'a) = 5(S(E,w'),a), 
= 5(JL(E),a) 

= \J E 'e^-(E) S ( E '' a ) 

Consequently, 

to e L(A(E)) & 5{E, w'a) n F ^ 

^r{E) E' e F 
&3E' e er^(E) \ r(e,E') = 1 

e {j E 'l^L-(E) L (E') 
<SE€ w-^LiE)) 

□ 

For any ARE _E, the automaton is called the derivated term finite 

automaton of E. 
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5.4 Back to Hamming and Levenshtein Derivation 

This subsection is devoted to show the link between HLARE derivation formulae 
and ARE ones. Given an HLARE E and a word w, the following proposition 
illustrates the fact that the expression D' W (E) of Definition |5] (resp. the set of 
expressions A w (E) of Definition [SJ and the expression jr (E) in Definition [TU] 

(resp. the set of expressions -§-{E) in Definition IT2"T) are syntactically equal up 
to the expression 0. 

Proposition 5. Let E be an HLARE over an alphabet E. For any symbol a in 
E, the two following conditions are satisfied: 

• i- a {E)e{{D' a {E) + $U,D' a {E)}, 
. e{A o (£)U{0},A a (£!)}. 

Proof. We prove the first membership relation. A similar proof can be given for 
the second one. 

By induction over the structure of an HLARE. 

1. If E = a e E, E = Ei + E 2 , E = E 1 ■ E 2 or if E = E{, the proposition is 
trivially checked by similarity of the formulae. 

2. If E = Mk(E'), by definition of jr(E): 
EFex(B') ) 6es( H fe-w((a),(6))(^(-F 1 ))) 

T,FGX(E>) m k-H((a),(e))( F ) 

^■(EFeX(iS')^eS,'H((e),(6))#o( IHI fc--H((£):(b))(^( i;1 )))) 

where X(E') = {E>} U [j weWn f-(E') with W n = ({] h ^ m{£)m=0 {b}Y ■ 

By definition of W,X{E') = {E'},H((a),(b)) € {1,0} and H((a), (e)) = -L 
for any two symbols a and b in E. 

Consequently: 

E 6eS (H fc _„ ((a) , (6)) (J(i5'))) \ 

+ Y.FeX(E>) M k-A F ) 

+ ^■(Efex(B'),ftes,«(( £ ),(b))#o( IHI fc--L(sr( F )))) / 



£(E) 




lrS E ) 



( mf(E')) \ 



V + 

W k (D' a (E')) 

E beE \ {Q} (H fe -i(^(^))) 



V + 



/ 



S {D' a (E) + $,D' a (E)}. 



J 
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3. If E = L k (E'), by definition of jr{E): 

1 Efex(B'),6es( L fe-£((a),(6))(3^( F ))) 
§r{E) = +Y,FeX{E>) l ^k-C({a),{ e )){F) 

V + ^■(Efex(B')^es,£(( £ ),(fc))#o( L fe-w((£),(b))(^( i;l )))) 

where X(E') = {E'} U [j weWc jL(E') with W c = (U6 eS ,A( £ ),(6))=oW)*- 

By definition of L, X(J5') = {£'}, £((a), (6)) £ {1,0} and £((a), (e)) 
jC((e, (a))) = 1 for any two symbols a and 6 in E. 

Consequently: 

E&es( L fc-£((a),(6))(sj;(£'))) 
f(E)=\ +U-i(E>) 



( 



M^'))) \ 
E^CU-iC^'))) 

V + £(£ 6es (i*-i($(£')))) / 

Finally, by induction hypothesis and by recurrence over k, 



ME) 



\ 



( U(D' a {E'))) 

+ E be ^i(D' b (E'))) 
+ U-i{E') 
V + D' a (j: beS (U-i(D' b (E')))) J 



D' a (E). 



□ 



As a corollary of Proposition [5j the proofs of the lemmas and propositions 
of Section |4] can be deduced from the corresponding ones of Section [5] 



6 Conclusion 

The similarity operators that equip the family of approximate regular expres- 
sions make AREs to be a nice tool to deal with approximate regular expression 
matching. The extension of dissimilar derivatives and partial derivatives to the 
family of AREs allows us to provide a syntactical solution to the approximate 
membership problem; moreover in each case the set of derivatives is finite and 
thus this extension also yields the construction of a recognizer. An additional 
advantage of similarity operators is that they can be combined with other regu- 
lar operators, such as intersection and complementation operators in order 
to produce even smaller expressions. 
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